[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967
[DRAFT] Perf: vectorize take_bits (up to 2.5x speedup)#8967gstvg wants to merge 1 commit intoapache:mainfrom
Conversation
0475fa3 to
37c1e6b
Compare
|
Hi @Dandandan, do you think this speedup justifies the usage of intrinsics ? Taking into account that it affects not only boolean arrays but every array with nulls, and that they can also be used in other parts of the code, like cmp kernels. Thanks! |
|
The justification to use intrinsics is that currently llvm generates suboptimal vectorization for bitmask packing, see llvm/llvm-project#96395 and llvm/llvm-project#121691 |
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
I forgot to mention that I ran the benchmarks with |
|
Just tested on AMD, also lower improvements: Details
Lower gains in zen 4 than zen 3 is due to llvm/llvm-project#91370, targeting znver3 for Zen4 CPU improves a bit.
|
Which issue does this PR close?
Part of #279, related to #8879
Rationale for this change
EDIT: with
-C target-cpu=sapphirerapids:What changes are included in this PR?
Vectorized take_bits for indices without nulls (https://rust.godbolt.org/z/YnqnqccMK)
Bitmask packing with intrinsics for sse2/avx2 (generic enough to be used in other parts of the code)
Are these changes tested?
Currently no, tests will be added if this PR is decided to move forward